-
Notifications
You must be signed in to change notification settings - Fork 28
Merge for tweaks that address BOM-table problems and step over psql-commands. #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…l as regular files.
pgsanity/pgsanity.py
Outdated
| bom_present = check_for_bom(nose) | ||
| sql_string = sql_string[len(nose):] if bom_present else sql_string | ||
| success, msg = check_string(sql_string.decode("utf-8")) | ||
| # ^ The above call to decode() is safe for both ASCII and UTF-8 data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit nit-picky, but I think I'd rather see all of this BOM stuff move in to check_for_bom(), which probably means that check_for_bom() would become remove_bom_if_exists(). It would take the full sql string and return the full sql string (with the BOM removed if it exists). I think that encapsulates the BOM handling a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had just noticed all of the other check__something_() -functions hanging out in this file and at the time I thought I'd go with the flow! Anyways I'm happy to rename the function and wrap up that logic all the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to neaten this up as per your suggestions: ca34532.
|
Thanks for sending this PR over - I'm hopeful that we'll get it merged in soon. Would you be able to add a few tests for your changes as well? If you make the changes to the BOM code that I suggested it should be pretty easy to write a few unit tests for that (strip when BOM is present, don't when it's not). Also the sqlprep changes should have tests as well - that code is so complex that the tests really come in handy when making changes. Thanks! |
…OM-table information, per a comment by the original author.
|
These changes look good. Do you plan on adding the flag to enable psql stripping and supporting the double-backslash to go back to sql? |
|
Once #14 looks like it's down for the count, we can circle around to the remaining transitions which deal with with |
|
On performance: Just did a quick benchmark and it looks like the earlier 92d4bfe runs for about 56 seconds on a 56MB file whereas the latest commit (0dd9de2) runs in about 187 seconds because of the character-by-character method. As I've said, things can be optimized from here. While we're on the subject, I'll note that on files of around that same ~50MB size, Python will bomb out due to memory-limitations if there's less than 300MB or so available in RAM. Just based off of where I saw crashes (on an extremely memory-impoverished machine), the main culprits seem to be the string slice done to remove BOM-tables and also the call to ...I don't have much experience with profiling Python code for memory-consumption and I'm guessing most folks who would use pgsanity will probably have access to decent hardware if they are doing bulk SQL-operations anyways. However, if you've got specific goals in mind regarding the resource-footprint, do let me know. |
|
Overall this looks pretty good to me. I haven't read through the main processing/transitions stuff in excruciating detail yet, but I like the overall direction. Do you think you'll be able to speed it up? |
…kens on the entire input string. This should prevent memory from getting exhausted so easily.
|
Work and life in general have been a little nuts lately, but a belated "yep" on planning to do optimization on this. Similar to the way you'd been doing it originally, we sorta hop between characters (tokens now, actually) instead of going character-by-character like I did the sorta proof-of-concept. In 2f158f0, the new processing algorithm works and runs pretty swiftly, but the approach of trying to find all tokens ahead of time is too memory-intensive and easily wakes up the OOM killer if given large files to process. I should be able to do another stint soon to address that. |
|
I've gotten the memory-issues under control a bit better now, I believe. Speed-wise, on the same 60MB file I mentioned earlier, processing now completes in just a hair less than 19 seconds, which I think is quite respectable. When I tried a larger ~160MB file, it appeared to hang indefinitely, which I discovered was due to
Along with those items, adding the optional |
|
@koyae, what do you want to do here for next steps? I'd rather not add any options to let the user control/limit the chardet detection. I would rather just pick 1024 characters or something like that and leave it at that. I'm also curious to know what the performance difference is between the current stable version of pgsanity and the one with your changes. I believe your approach to parsing is better from a code-sanity perspective, and it handles some situations that the old code doesn't, but I don't want to regress on the speed in any significant way. If the new code is slightly slower (10, even 20 percent) then I think that's acceptable, but if it doubles the runtime then I think we'll need to figure something else out. Again - thank you for all of the work you have put in to this so far. Please let me know how you'd like to move forward. |
|
Hey again! Thanks for your patience on all this. Let me try to answer some of your questions. Avoiding need for optional commandline flags:That's a toughie; obviously, pgsanity can't work properly if it can't decode files correctly, but it's also nice to have a tool that works as point-and-shoot without extra nuances to learn. In light of the results of the performance-tests discussed further down, the prospects of having the tool be both fast and easy-to-use actually do look optimistic. To eek out just a touch of flexibility in case some users might need it, I'd like to suggest the following:
If the user was working with some weird charset, and pgsanity flags the syntax because a character may have parsed screwy (because of the wrong guessed encoding), the secondary path would look like this:
Performance-considerations:On a decent machine, encoding scans really don't impact performance that much, and I wasn't able to reproduce the freeze I had gotten earlier from scanning a big file. With the files I tested with, it seemed we only had about a 1-second time-difference between scanning the whole file versus just part of it. That's all good news. I did about 5 to 10 runs per file to see if there was much variance, but found little on the testing environment described below. When I checked for I/O wait-time, I saw it was virtually 0, so I've omitted any stats about it. Test-machine: AWS m4.largeAlthough this machine has two cores, this currently doesn't affect our speed since pgsanity does all processing before handing anything over to ecpg. When ecpg spins up to 100% (on one core), pgsanity's CPU-usage drops from 100% down to around 4%, meaning the second core only gives us a boost of around 0.04. Specs:Test 1: 302,950 KB (trsq_not_uploading_used_out_utm_coords.sql)Build 3fab645:
Build 00b3977:
Test2: Test file: 167,881 KB (07_tr_trs_trsq_trsqq_2.sql)Build 3fab645:
Build 00b3977:
|
|
More benchmark info (using same system as described above):
As I was running these tests, I found a few files (00a.sql, 00b.sql, 00.sql, and 08.sql – not shown here) wherein the _full_ scan either crashed out or took a very long time to complete. I'm going to see if there's anything I can glean from looking at them more carefully. These were mostly smaller files so I believe the cause is content-related. If that's true, then length is more-or-less irrelevant, which means scanning only 1024 bytes may merely _decrease_ chances of encountering a poison sequence under If my theory holds any water, there's a chance the
|
|
@koyae. I'm not exactly sure how to proceed with this. You've put a ton of work in to this and there are parts of it that are really good and that I think we could merge, but I'm nervous about the size of the change and the potential for disruption. I don't have a solid understanding of the performance impact and I'm not sure about the chardet issues. I think the best approach going forward would be to break this pull request up in to smaller pieces that are easier to review, test, and reason about. If you agree with that approach I think we should close out this pull request and if you create new and smaller PRs with some of these changes broken out we'll get them merged. What do you think? |
|
To summarize the performance-numbers posted above:
In other words, passing 1024 characters to determine encoding is faster than passing the whole file by a margin of between 3% and 4%. The _rest_ of the difference is due to the tokenization which was added since your last commit. Sidenote: I've now tested
Sidenote 2: I did a comparison, and whoever said that using a |
|
I'd be fine breaking up the various changes as you suggest. Roughly, I'd organize it:
Does that grouping sound alright? |

I made some fairly minor alterations in the interest of making pgsanity easier to use. Changes include addressing #5 for both piped and file-based input, plus an edit which allows pgsanity to deal with
psql-directives that might show up in incoming data. If the latter is not desired (at least as the default behavior), I can either move those changes over to a separate branch and then submit another pull-request once I've done so, or we could allow a flag to be passed that would prompt the ignore-psql-directives behavior.